Leer texto

Lo primero que tenemos que hacer es cargar el texto. Para nuestro ejemplo, cargaremos una obra del proyecto Gutenberg.


In [18]:
fileName='book.txt'

Ahora vamos a eliminar todo aquello que no se consideren cadenas de texto válidas. Para ello definiremos una función que elimine aquello que no queremos contabilizar.


In [19]:
import re

def removePunctuation(text):
    return re.sub('[^a-z| |0-9]', '', text.strip().lower())

Ahora vamos a crear el primer RDD del contenido del libro.


In [21]:
shakespeareRDD = (sc
                  .textFile(fileName, 8)
                  .map(removePunctuation))

In [22]:
shakespeareRDD.take(4)


Out[22]:
[u'the project gutenberg ebook of anecdotes of animals by unknown',
 u'',
 u'this ebook is for the use of anyone anywhere at no cost and with',
 u'almost no restrictions whatsoever  you may copy it give it away or']

In [23]:
print '\n'.join(shakespeareRDD
                .zipWithIndex()  # to (line, lineNum)
                .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line'
                .take(15))


0: the project gutenberg ebook of anecdotes of animals by unknown
1: 
2: this ebook is for the use of anyone anywhere at no cost and with
3: almost no restrictions whatsoever  you may copy it give it away or
4: reuse it under the terms of the project gutenberg license included
5: with this ebook or online at wwwgutenbergorg
6: 
7: 
8: title anecdotes of animals
9: 
10: author unknown
11: 
12: illustrator percy j billinghvrst
13: 
14: release date may 11 2008 ebook 25428

In [ ]: